Census Data and tidycensus

In this tutorial, we will work through several exercises using the tidycensus R package to fetch, wrangle, and map census data.

The key tidycensus functions we will use today are:

  • census_api_key: makes your Census API key available to tidycensus

  • load_variables: retrieves a dataframe of available census data variables

  • get_decennial: fetch census data from a recent decennial censuses - 2000, 2010 (and soon 2020)

  • get_acs: fetch census data from an ACS (American Community Survey) 1 or 5 year dataset, 2005 - 2020.

Setup

Be sure to clone or downloaded and unzip the workshop files from: https://github.com/dlab-berkeley/Census-Data-in-R

Then:

  1. Open the folder with the workshop files

  2. Double-click on the R Project file Census-Data-in-R.Rproj

  3. This should open RStudio - with the Files panel displaying the workshop folder contents.

  4. Double-click on the file Census-Data-in-R.Rmd to follow along!

You can also click on the file Census-Data-in-R.html in the Files tab to open the workshop tutorial in a web brower.

Install packages

If you installed any of these packages awhile ago, (especially tidycensus), it’s a good idea to install updates when you can (though not during the workshop as things can break!).

# Uncomment this to install packages, if necessary.
# install.packages(c("here", "tidyverse", "sf", "leaflet", "mapview", "tigris", "tidycensus"))

library(here)
## here() starts at /Users/pattyf/Documents/Dlab/workshops/AY2022/Census-Data-in-R-Sp22
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.1     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.1.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(sf)
## Warning: package 'sf' was built under R version 4.0.5
## Linking to GEOS 3.9.1, GDAL 3.4.0, PROJ 8.1.1; sf_use_s2() is TRUE
library(leaflet)
## Warning: package 'leaflet' was built under R version 4.0.5
library(mapview)
## Warning: multiple methods tables found for 'crop'
## Warning: multiple methods tables found for 'extend'
library(tigris)
## Warning: package 'tigris' was built under R version 4.0.5
## To enable caching of data, set `options(tigris_use_cache = TRUE)`
## in your R script or .Rprofile.
library(tidycensus)
## Warning: package 'tidycensus' was built under R version 4.0.5
## 
## Attaching package: 'tidycensus'
## The following object is masked from 'package:tigris':
## 
##     fips_codes

These seven libraries should be loaded in your environment now.

# If you run this chunk, output from the "here" function should be visible below. This is your local directory path. We can use this to import files later on. 
here()
## [1] "/Users/pattyf/Documents/Dlab/workshops/AY2022/Census-Data-in-R-Sp22"

Census API Key

You need a Census API key to programmatically fetch census data.

For more info on all available Census APIs see: https://www.census.gov/data/developers/data-sets.html

Add Your Census API Key

To use your Census API Key in R

  1. Copy and paste your Census API key from your email

  2. Use the tidycensus function census_api_key to register your API key with tidycensus. Don’t forget to put quotes around the key!.

# Install your census api key - long alphanumeric string
census_api_key("THE_BIG_LONG_ALPHANUMERIC_API_KEY_YOU_GOT_FROM_CENSUS")

Another way to add your Census API Key:

I keep my key in a file so no one can see it. One way to do this is by making a script that creates a variable key, and then using the source function to add that script as an object into your coding environment. The code chunk below is an example of how you might do that:

# source (run) an r script that creates a variable with my key
#source("/Users/pattyf/Documents/Dlab/workshops/keys/census_api_key.R")

#print(my_census_api_key) 

# register the key
census_api_key(key = my_census_api_key)

Decennial Census Data

The get_decennial function

We start by fetching total population from the 2010 Census with tidycensus’s get_decennial function. Let’s first talk about the code.

pop2010 <- get_decennial(geography = "state",   # census tabulation unit
                         variables = "P001001", # variable(s) of interest
                         year = 2010)           # census year
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
head(pop2010)          
## # A tibble: 6 x 4
##   GEOID NAME       variable    value
##   <chr> <chr>      <chr>       <dbl>
## 1 01    Alabama    P001001   4779736
## 2 02    Alaska     P001001    710231
## 3 04    Arizona    P001001   6392017
## 4 05    Arkansas   P001001   2915918
## 5 06    California P001001  37253956
## 6 22    Louisiana  P001001   4533372

Fetching data for more than one Census variable

We can pass a vector of census identifiers to the variables function argument if we want to get data for more than one variable. Below we add P0002002 for population in urban areas.

pop2010 <- get_decennial(geography="state",
                         variables = c("P001001","P002002"), # variable(s) of interest
                         year = 2010)           # census year
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
# take a look
head(pop2010)
## # A tibble: 6 x 4
##   GEOID NAME       variable    value
##   <chr> <chr>      <chr>       <dbl>
## 1 01    Alabama    P001001   4779736
## 2 02    Alaska     P001001    710231
## 3 04    Arizona    P001001   6392017
## 4 05    Arkansas   P001001   2915918
## 5 06    California P001001  37253956
## 6 22    Louisiana  P001001   4533372

We can see the data for both variables if we sort the output by county name.

# Sort dataframe by county names (the NAME column)
pop2010 %>% arrange(NAME) %>% head()
## # A tibble: 6 x 4
##   GEOID NAME    variable   value
##   <chr> <chr>   <chr>      <dbl>
## 1 01    Alabama P001001  4779736
## 2 01    Alabama P002002  2821804
## 3 02    Alaska  P001001   710231
## 4 02    Alaska  P002002   468893
## 5 04    Arizona P001001  6392017
## 6 04    Arizona P002002  5740659

tidycensus returns tidy data

By default, tidycensus returns data in a tidy, or long format that allows data for multiple variables to be contained within the variable and value columns.

This is in contrast to untidy, or wide data where each variable is in its own column.

tidycensus can return wide data if you can add the parameter output=wide to the function call.

# wide format
pop2010w <- get_decennial(geography = "state",   # census tabulation unit
                         variables = c("P001001","P002002"), # variable(s) of interest
                         year = 2010,           # census year
                         output="wide")         # get output in wide format
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
head(pop2010w) 
## # A tibble: 6 x 4
##   GEOID NAME        P001001  P002002
##   <chr> <chr>         <dbl>    <dbl>
## 1 01    Alabama     4779736  2821804
## 2 02    Alaska       710231   468893
## 3 04    Arizona     6392017  5740659
## 4 05    Arkansas    2915918  1637589
## 5 06    California 37253956 35373606
## 6 22    Louisiana   4533372  3317805

The GEOID column

The GEOID column is included in tidycensus output by default.

This is a Census geographic identifier for the tabulation unit.

The GEOID is sometimes called the Census FIPS code and for most tabulation units these are the same.

The GEOID makes it possible to link to Census demographic data to Census geographic data and make maps. We will do this in a bit.

The GEOID is a text string and must be quoted.

  • Beware of GEOID leading zeros, since some software will remove these and convert GEIOD values to numbers (rather than text strings).

Question: What is the GEOID for California?

Census Tabulation Units

Public census data is typically aggregated by census geographies to protect privacy.

These census geographies are called Census tabulation units.

  • Some of these are real administrative units like states and counties.
  • Others are statistical units created by the census, like census tracts and block groups.

Some of the most common geographic tabulation units and their tidycensus function abbreviations are shown below, along with required and available filters that limit what data are returned.

Geography Definition Filter(s) Used in tidycensus
“us” United States get_acs(), get_decennial()
“region” Census region get_acs(), get_decennial()
“state” State or equivalent state get_acs(), get_decennial()
“county” County or equivalent state, county get_acs(), get_decennial()
“place” Census place state get_acs(), get_decennial()
“tract” Census tract state, county get_acs(), get_decennial()
“block group” Census block group state, county get_acs(), get_decennial()
“block” Census block state, county get_decennial() only!

get_decennial Tabulation Units and Filters

Let’s take a few minutes to practice fetching population data with the get_decennial function.

  • See ?get_decennial for help

Challenge 1

Open Census-Data-in-R-Challenges.Rmd and use the get_decennial function like we’ve seen above to fetch population data. Solutions are available in the Solutions folder, as needed.

Changing the tabulation unit

Let’s fetch 2010 population data for CA counties

What changes in the code?

get_decennial(geography = "county",              # census tabulation unit
                          variables = "P001001", # variable(s) of interest
                          year = 2010,           # census year
                          state='CA')            # Filter by state is CA
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## # A tibble: 58 x 4
##    GEOID NAME                            variable   value
##    <chr> <chr>                           <chr>      <dbl>
##  1 06011 Colusa County, California       P001001    21419
##  2 06007 Butte County, California        P001001   220000
##  3 06001 Alameda County, California      P001001  1510271
##  4 06003 Alpine County, California       P001001     1175
##  5 06005 Amador County, California       P001001    38091
##  6 06009 Calaveras County, California    P001001    45578
##  7 06013 Contra Costa County, California P001001  1049025
##  8 06015 Del Norte County, California    P001001    28610
##  9 06031 Kings County, California        P001001   152982
## 10 06021 Glenn County, California        P001001    28122
## # … with 48 more rows

Questions

  • How do we specify the state of CA above? How else can we?
  • Can you fetch population data for all counties in the USA or do you need to have a state= filter?

Adding a county filter

You can also filter tidycensus results by county

get_decennial(geography = "county",              # census tabulation unit
                          variables = "P001001", # variable(s) of interest
                          year = 2010,           # census year
                          state='CA',            # Filter by state is CA
                          county='Alameda')      # Filter by county Alameda
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## # A tibble: 1 x 4
##   GEOID NAME                       variable   value
##   <chr> <chr>                      <chr>      <dbl>
## 1 06001 Alameda County, California P001001  1510271

Challenge 2

In Census-Data-in-R-Challenges.Rmd, alter the code above to fetch 2010 population for Alameda & San Francisco Counties. Then try Challenge 2B & 2C.

Visualizing Results

We can visualize data to get a quick overview of the distribution of the values.

It’s a first step in exploratory data analysis and a last step in data communication.

ggplot2 is the most commonly used R package for data visualization.

  • It is loaded when you load the tidyverse package.

Let’s use it to visualize the population data.

Plot 2010 Population by state

Use ggplot2 to create an ordered horizontal bar chart.

# create a plot.
pop_plot <- ggplot(data=pop2010, 
                   # set aesthetic variables
                   aes(x=value/1000000, y=reorder(NAME,value)) ) + 
                   # pick geometry
                   geom_bar(stat="identity") +  
                   # add theme and titles. 
                   theme_minimal() + 
                   labs(title = "2010 US Population by State") +
                   xlab("Population (in Millions)") +
                   ylab("State")

# display the plot.
pop_plot

Developing your ggplot2 knowledge can really enhance your data analysis skills.

In combination with tidycensus it creates a powerful, reproducible data science workflow.

Identifying Census Variables

In the code above we fetched data for total population in 2010 using the variable "P001001".

That is not an obvious variable name, so how do we get those census data identifiers?

We can use the tidycensus load_variables function for this.

load_variables function

For any census dataset like the decennial census or the ACS 1 or 5-year estimates, use the load_variables function to fetch all available variables and identifiers.

Since these datasets have many, many variables, save the resulting dataframe to a variable and cache it locally so you do not need to repeatedly retrieve it over the web.

vars2010 <- load_variables(year=2010,        # Year or end year for ACS-5yr
                           dataset = 'sf1',  # 'sf1' for decennial census
                           cache = TRUE)     # Save fetched data locally

# How large is the output
dim(vars2010)
## [1] 8959    3
# Take a look with head or View
head(vars2010)
## # A tibble: 6 x 3
##   name    label                                concept        
##   <chr>   <chr>                                <chr>          
## 1 H001001 Total                                HOUSING UNITS  
## 2 H002001 Total                                URBAN AND RURAL
## 3 H002002 Total!!Urban                         URBAN AND RURAL
## 4 H002003 Total!!Urban!!Inside urbanized areas URBAN AND RURAL
## 5 H002004 Total!!Urban!!Inside urban clusters  URBAN AND RURAL
## 6 H002005 Total!!Rural                         URBAN AND RURAL

2010 Decennial Census Tables & Variables

  • Over 3,000 unique variables that describe population and housing characteristics

  • Organized in 333 Tables

    • 177 population tables (identified with a ‘’P’’) available to the block level
    • 58 housing tables (identified with an ‘’H’’) available to the block level
    • 82 population tables (identified with a ‘’PCT’’) available to the census tract level
    • 4 housing tables (identified with an “HCT”) available to the census tract level
    • 10 population tables (identified with a “PCO”) available to the county level
    • plus 2 additional PCT tables

See: https://www.census.gov/data/datasets/2010/dec/summary-file-1.html

What Variable Has the 2010 Total Population value?

We know this from our previous code blocks, but let’s find it for practice navigating the dataframe.

  • Let’s sort and filter the vars2010 dataframe to find it.

Questions:

What 2010 decennial census variable contains…

  • Median Age

  • Average Family Size

  • Number of occupied housing units

*See Census-Data-in-R-Solutions.Rmd if needed (under Variable Questions)

Challenge 3

Return to Census-Data-in-R-Challenges.Rmd and use the get_decennial function to fetch and plot an Avg Family Sizevaraible by CA County in 2010, and name the call as a dataframe, ca_fam_size. Once you’ve done that, plot the dataframe with the ggplotcall below. Hint: “P037001”

Challenge 4

Repeat the previous challenge with data from the 2000 decennial census. Don’t assume variable names are the same across the 2000 and 2010 census

Use load_variables to check the variable name!

Census Tract Data

Census tracts are the most commonly used census tabulation unit.

Let’s fetch population data for the census tabulation unit to tract

Because of the large number of census tracts, you MUST specify a state when requesting these data with tidycensus.

## Fetch population by **tract** for California.
ca_tract_pop2010 <- get_decennial(geography = "tract",   # census tab unit
                                   variables = "P001001", # var of interest
                                   year = 2010,           # census year
                                   state='CA')      # State filter
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
# How many tracts in CA
dim(ca_tract_pop2010)
## [1] 8057    4
# take a look
head(ca_tract_pop2010)
## # A tibble: 6 x 4
##   GEOID       NAME                                                variable value
##   <chr>       <chr>                                               <chr>    <dbl>
## 1 06037434004 Census Tract 4340.04, Los Angeles County, Californ… P001001   2796
## 2 06037460000 Census Tract 4600, Los Angeles County, California   P001001   4851
## 3 06037460200 Census Tract 4602, Los Angeles County, California   P001001   5315
## 4 06037460301 Census Tract 4603.01, Los Angeles County, Californ… P001001   4638
## 5 06037460302 Census Tract 4603.02, Los Angeles County, Californ… P001001   4442
## 6 06037460401 Census Tract 4604.01, Los Angeles County, Californ… P001001    878

Fetching Census Tract Data

Census tract data can be quite large!

Fortunately, you can also limit the results to one or more counties.

tract_pop2010 <- get_decennial(geography = "tract",   # census tabulation unit
                         variables = "P001001",       # variable of interest
                         year = 2010,                 # census year - only one!
                         state="CA",                  # limit to California
                         county=c("Alameda","Contra Costa"))  # & counties
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
dim(tract_pop2010)
## [1] 569   4

Customizing tidycensus output

What two things are new here?

#urban and rural pop for 3 CA counties
ur_pop10 <- get_decennial(geography = "county",  # census tabulation unit
                           variables = c(urban="P002002",rural="P002005"),
                           year = 2010, 
                           summary_var = "P002001",  # The denominator
                           state='CA',
                           county=c("Napa","Sonoma","Mendocino"))
## Getting data from the 2010 decennial Census
## Using Census Summary File 1

When fetching census data…

We have already specified more than one variable:

variables = c("P002002","P002005")
  1. You can also rename the values in the output ‘variable’ column.
variables = c(urban="P002002",rural="P002005")
  1. You can identify a summary_var (a denominator - here, the total count of all people or households surveyed. Can be used for calculations like percent of total.)
summary_var = "P002001"

Now let’s take a look at the resultant dataframe

# take a look at the results
ur_pop10
## # A tibble: 6 x 5
##   GEOID NAME                         variable  value summary_value
##   <chr> <chr>                        <chr>     <dbl>         <dbl>
## 1 06045 Mendocino County, California urban     48110         87841
## 2 06055 Napa County, California      urban    118194        136484
## 3 06097 Sonoma County, California    urban    424102        483878
## 4 06045 Mendocino County, California rural     39731         87841
## 5 06055 Napa County, California      rural     18290        136484
## 6 06097 Sonoma County, California    rural     59776        483878

Calculating Percents

The summary_value column comes in handy when you want to compute percent of total, for example:

# Calculate the percent of population that is Urban or Rural
ur_pop10 <- ur_pop10 %>%
            mutate(pct = 100 * (value / summary_value))

# Take a look at the output.
ur_pop10 
## # A tibble: 6 x 6
##   GEOID NAME                         variable  value summary_value   pct
##   <chr> <chr>                        <chr>     <dbl>         <dbl> <dbl>
## 1 06045 Mendocino County, California urban     48110         87841  54.8
## 2 06055 Napa County, California      urban    118194        136484  86.6
## 3 06097 Sonoma County, California    urban    424102        483878  87.6
## 4 06045 Mendocino County, California rural     39731         87841  45.2
## 5 06055 Napa County, California      rural     18290        136484  13.4
## 6 06097 Sonoma County, California    rural     59776        483878  12.4

A plot gives us compact visual summaries of the data.

## Plot it with ggplot2
myplot <- ggplot(data = ur_pop10, 
          mapping = aes(x = NAME, fill = variable, 
                     y = ifelse(test = variable == "urban", 
                                yes = -pct, no = pct))) +
          geom_bar(stat = "identity") +
          scale_y_continuous(labels = abs, limits=c(-100,100)) +
          labs(title="Urban & Rural Population in Wine Country", 
               x="County", y = " Percent of Population", fill="") +
          coord_flip()

myplot

Don’t worry if you don’t get all the ggplot code now. It’s here for reference.

  • You may want to check out D-Lab’s R Data Visualization with ggplot workshop!

Data Wrangling to Combine Data from 2 Censuses

You can use your R skills to reformat the data and make it more usable.

Let’s fetch population data for 2010 and 2000 by state.

Then we will combine these into one data frame using the tidyverses::bind_rows function

# Fetch 2000 population data by state
pop2000 <- get_decennial(geography = "state",
                         variables = c(pop2000="P001001"), 
                         year = 2000)
## Getting data from the 2000 decennial Census
## Using Census Summary File 1
# Fetch 2010 population data by state
pop2010 <- get_decennial(geography = "state",
                         variables = c(pop2010="P001001"), 
                         year = 2010)
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
# Use tidyverse `bind_rows` function to combine the data for these years
state_pop <- bind_rows(pop2000, pop2010)

# Take a look with head or View
state_pop %>% arrange(NAME) %>% head(10)
## # A tibble: 10 x 4
##    GEOID NAME       variable    value
##    <chr> <chr>      <chr>       <dbl>
##  1 01    Alabama    pop2000   4447100
##  2 01    Alabama    pop2010   4779736
##  3 02    Alaska     pop2000    626932
##  4 02    Alaska     pop2010    710231
##  5 04    Arizona    pop2000   5130632
##  6 04    Arizona    pop2010   6392017
##  7 05    Arkansas   pop2000   2673400
##  8 05    Arkansas   pop2010   2915918
##  9 06    California pop2000  33871648
## 10 06    California pop2010  37253956

Saving tidycensus output

The data we fetch using tidycensus is stored in an R dataframe.

We can use the write.csv or write_csv function to save the contents of a dataframe to a CSV file.

write_csv(state_pop, here('data_out/state_pop_2010.csv') )

Any Questions?

Mapping Census Data

You can fetch census geographic data by adding the parameter geometry=TRUE to tidycensus functions

  • Under the hood, tidycensus calls the tigris package to fetch data from the Census Geographic Data APIs.

You can then use your favorite R mapping packages like sf, ggplot, tmap, mapview and leaflet to make maps.

Geometry Options

Before fetching census geographic data, we need to set the option tigris_use_cache to TRUE

Caching saves data locally. This greatly speeds things up if you fetch the same census geographic data repeatedly.

# Tigris options - used by tidycensus
# Cache retrieved geographic data locally
options(tigris_use_cache = TRUE)  

Fetch Geographic Boundary Data with tidycensus

We fetch the census geographic data by setting geometry=TRUE.

pop2010geo <- get_decennial(geography = "state", 
                          variables = c(pop10="P001001"), 
                          year = 2010, 
                          output="wide", 
                          geometry=TRUE) # Fetch geometry data for mapping
## Getting data from the 2010 decennial Census
## Using Census Summary File 1

Take a look

Let’s take a minute to discuss the format of an sf spatial object.

head(pop2010geo, 3)
## Simple feature collection with 3 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -90.41814 ymin: 41.23796 xmax: -66.9499 ymax: 48.19097
## Geodetic CRS:  NAD83
## # A tibble: 3 x 4
##   GEOID NAME        pop10                                               geometry
##   <chr> <chr>       <dbl>                                     <MULTIPOLYGON [°]>
## 1 23    Maine      1.33e6 (((-67.61976 44.51975, -67.61541 44.52197, -67.58774 …
## 2 25    Massachus… 6.55e6 (((-70.83204 41.6065, -70.82373 41.59857, -70.82092 4…
## 3 26    Michigan   9.88e6 (((-88.68443 48.11578, -88.67563 48.12044, -88.67639 …

Geospatial Data in R

The tidycensus package uses the R sf package to manage geospatial data.

R sf objects include:

  • a dataframe with a geometry column labeled geometry

    • The geometry can be of type POINT, LINE, POLYGON
    • or, MULTIPOINT, MULTILINE or MULTIPOLGYON
  • a CRS (coordinate reference system), specified by

    • epsg(SRID) code
    • proj4string

For a deeper understanding of the sf package and its functionality, we recommend

Census Data Coordinate Reference System (CRS)

All geospatial data are referenced to the surface of the earth with a CRS, or coordinate reference system. Anyone working with geospatial data will need to develop an understanding of CRSs.

Fortunately, many of us are familiar with longitude and latitude, which are geographic coordinates. But there are different versions of geographic CRSs. And there are also projected CRSs which transform longitude and latitude to 2 dimensional surface for mapping & analysis.

All census geographic data use the NAD83 geographic CRS. NAD83 stands for North American Datum of 1983. This CRS (or version of latitude and longitude) is best for locations in North America.

Many geospatial operations require you transform data to a common CRS before conducting spatial analysis or mapping.

  • This could be an issue if you try to combine the census geospatial data with other geospatial data. But it is not an issue in this tutorial.

An in-depth discussion of CRSs is outside the scope of this workshop. See Geocomputation in R for more information.

Mapping sf Spatial Objects

We can use sf::plot to make a quick map the geometry stored in an sf spatial object.

# plot the geometry column data
plot(pop2010geo$geometry)

The Challenge of US maps

The vast geographic extent and non-contiguous nature of the USA makes it difficult to map.

Fortunately, tidycensus includes a shift_geo parameter to shift AK & HI to below Texas.

pop2010geo_shifted <- get_decennial(geography = "state", 
                                    variables = c(pop10="P001001"), 
                                    output="wide",
                                    year = 2010, 
                                    geometry=TRUE, 
                                    shift_geo=TRUE)
## Warning: The `shift_geo` argument is deprecated and will be removed in a future
## release. We recommend using `tigris::shift_geometry()` instead.
## Getting data from the 2010 decennial Census
## Using feature geometry obtained from the albersusa package
## Using Census Summary File 1
## Please note: Alaska and Hawaii are being shifted and are not to scale.
## old-style crs object detected; please recreate object with a recent sf::st_crs()
## Shift Happens!
plot(pop2010geo_shifted$geometry)

Saving Spatial Objects

You can save any sf spatial data object to an ESRI shapefile using st_write

st_write(pop2010geo_shifted, here("data_out/usa_pop2010_shifted.shp"))

Now take a look at the output shapefile.

# Check to see if the data was written out to a shapefile
dir(here("data_out")) 

ESRI Shapefiles

You can see from this output that an ESRI shapefile is actually a collection of files that all have the same prefix.

Shapefiles are the most common file format for geospatial data. So it’s worthwhile to learn more about them if you will be working with census geographic data.

Mapping Data Values

You can use the sf plot command to make a map that sets the color of the geometry by the data values

  • This type of map is called a thematic map.

  • When the features being plotted are areas (or polygons), it’s called a choropleth map!

# Name the column with the variable values to make
# a thematic map, also called a choropleth map.
plot(pop2010geo_shifted['pop10'])  

ggplot2 Map

ggplot knows how to map sf objects!

ggplot(pop2010geo_shifted, aes(fill = pop10)) + 
  geom_sf()  # tells ggplot that geographic data are being plotted

If you are familiar with sf objects and ggplot you can further customize your maps.

ggplot(pop2010geo_shifted, aes(fill = pop10)) + 
  geom_sf(color=NA) + # What does color=NA do
  coord_sf(crs = 3857) + # Dynamically change the CRS
  scale_fill_viridis_c(option = "viridis")  # Change the color palette

                                            # Try different options, e.g.
                                            # plasma, magma, inferno, cividis

Challenge 5

In your Census-Data-in-R-Challenges.Rmd file, create a map of Median Age by California County in 2010. Solutions are in the Census-Data-in-R-Solutions.Rmd file

Fetch Census Data and Geometry for Multiple States or Counties

We can fetch Census data and geometry for more than one state or county with same function call.

  • This is so much easier than any alternative approach!

  • It can be applied to any available geographic tabulation areas (eg states, counties, tracts, places).

Let’s try it with Census Tracts!

Fetch tract population and geometry data for Bay Area Counties.

bay_counties <- c("Alameda", "Contra Costa", "Marin", "San Francisco",
                  "Sonoma", "Napa","Solano", "San Mateo", "Santa Clara")

bayarea_pop10 <- get_decennial(geography = "tract", 
                      variables = "P001001", 
                      year = 2010, 
                      state='CA',
                      county=bay_counties,
                      geometry=T)
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
# Quick map
plot(bayarea_pop10['value'])

Any Questions?

Fetching ACS data with get_acs

  • ACS data contains the most recent information about the American population.

  • We can use the tidycensus function get_acs to retrieve ACS data using code very similar to get_decennial.

BUT the workflow is more complex because:

  1. ACS data has a lot more variables, and

  2. ACS data are sample data, so each ACS variable that you retrieve with tidycensus will fetch both an estimate of the value and a margin of error.

ACS Data Products

The ACS has two primary data products - the ACS 1 year estimates and the 5 year estimates.

  • The ACS 1 year estimates are more current but hasve a larger margin of error and is not available for Census geographies with a population of < 65,000.

  • The ACS 5-year estimates are more stable but represent a larger time period.

  • The ACS 3 year estimates has been discontinued.

Fetch metadata on ACS 5-Year Variables

Let’s use the load_variables function to get a dataframe of all variables from the ACS 2016—2020 5-year dataset.

  • Note: we change the dataset value to acs5 where before we used sf1 to fetch info on the decennial census variables.
vars_acs2020 <- load_variables(year=2020,       # end year 2016-2020 period
                              dataset = 'acs5', # the ACS data product
                              cache = T)        # Save locally for future use

# how many variables?
dim(vars_acs2020)
## [1] 27850     3

Exploring the ACS Variables

View the vars_acs2020 dataframe to find the variable name for median household income.

#View(vars_acs2020)

Question Is the variable name for total population in the ACS 5 year 2020 data the same as it is in the 2010 census data?

Fetch ACS Data on Median Household Income

Let’s fetch the median household income data for San Francisco County by census tract.

med_hhincome <- get_acs(geography='tract',
                        variables="B19013_001",
                        year = 2020,
                        state='CA',
                        county='San Francisco',
                        geometry=TRUE # get the geography too
                        )
## Getting data from the 2016-2020 5-year ACS

Take a look at the output

head(med_hhincome)
## Simple feature collection with 6 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.5099 ymin: 37.70826 xmax: -122.4022 ymax: 37.79016
## Geodetic CRS:  NAD83
##         GEOID                                                  NAME   variable
## 1 06075035202 Census Tract 352.02, San Francisco County, California B19013_001
## 2 06075040100    Census Tract 401, San Francisco County, California B19013_001
## 3 06075047702 Census Tract 477.02, San Francisco County, California B19013_001
## 4 06075026403 Census Tract 264.03, San Francisco County, California B19013_001
## 5 06075012100    Census Tract 121, San Francisco County, California B19013_001
## 6 06075020300    Census Tract 203, San Francisco County, California B19013_001
##   estimate   moe                       geometry
## 1    99408 19727 MULTIPOLYGON (((-122.5099 3...
## 2   130625 21973 MULTIPOLYGON (((-122.4648 3...
## 3   125156 33460 MULTIPOLYGON (((-122.4887 3...
## 4    72837 18333 MULTIPOLYGON (((-122.4113 3...
## 5    75316  7363 MULTIPOLYGON (((-122.4154 3...
## 6   162059 43630 MULTIPOLYGON (((-122.4352 3...

ACS Output

The census data returned by the get_acs function is a bit different from that returned by get_decennial.

  • What is the name of the variable containing the income data?

  • What is the name if we set output="wide"?

med_hhincome_wide <- get_acs(geography='tract',
                        variables="B19013_001",
                        year = 2020,
                        state='CA',
                        county='San Francisco',
                        geometry=TRUE, # get the geography too
                        output="wide"
                        )
## Getting data from the 2016-2020 5-year ACS
# uncomment and run to view
# head(med_hhincome_wide)

Map Median Household Income by tract

Use sf::plot to create a map of median household income in San Francisco.

plot(med_hhincome['estimate'])

What do you think of that map?

It’s odd because San Francisco County is not the same as the city of San Francisco and what we want to map is the city.

Create a map with ggplot

We can use ggplot to zoom in on the city by setting the x axis limits to a narrower geographic range.

ggplot(med_hhincome, aes(fill = estimate)) + 
  geom_sf() +
  xlim(-122.55, -122.3)

Question

Why do you think we have NA values in the ACS estimates?

Fetching Multiple ACS-5 Variables

We can drill down into the ACS data by fetching data for subgroups, where available.

Let’s fetch median household income by race.

First identify the variables of interest.

# Median household income by race/ethnicity: Variables from ACS 2015—19
# All households =   "B19013_001",
inc_by_race <- c(White = "B19013H_001",
                 Black = "B19013B_001",
                 Asian = "B19013D_001",
                 Hispanic = "B19013I_001" )

Fetch census tract data for multiple variables at once.

# Fetch the Data
alco_mhhinc_by_race <- get_acs(geography='tract',
                                  variables=inc_by_race,
                                  year = 2019,
                                  state='CA',
                                  county='Alameda',
                                  geometry=T )
## Getting data from the 2015-2019 5-year ACS

Facet Mapping

Facet maps are a way to create visualizations of small multiples, or subsets of the data in order to facilitate comparisons. Here, we use ggplot’s facet_wrap function to make multiple maps of median household income by race for Alameda County.

# Create the map
medhhinc_facet_map <- alco_mhhinc_by_race %>%
                        ggplot(aes(fill = estimate)) +
                          facet_wrap(~variable) +
                          geom_sf(color=NA) +   # why color=NA?
                          scale_fill_viridis_c(option="magma")

# Display the map
medhhinc_facet_map

Challenge 6

In Census-Data-in-R-Challenges.Rmd file, Make a ggplot map of MEDIAN GROSS RENT in San Francisco County by tract using data from the ACS 2016—2020 5-year product. Check Census-Data-in-R-Solutions.Rmd for answers, as needed.

Any Questions?

Interactive Mapping

Interactive mapping gives the RStudio environment some of the functionality of desktop GIS.

There are a number of R packages that you can use, including:

  • mapview: quick interactive exploratory data viewing

  • tmap: great static and interactive maps

  • Leaflet: highly customizable interactive maps

All of these are based on the Leaflet Javascript Library.

mapview

Let’s use mapview to make quick interactive maps of the median hhousehold income data

mapview(med_hhincome)

When passed the name of an sf object and no other options, mapview will:

  • display the geometry using a single color for the fill and for the stroke

  • display the feature ID on hover

  • display the data from the dataframe on click

Mapview Thematic Maps

The zcol argument will take a column name and color the features by the values in that column.

mapview(med_hhincome, zcol="estimate")

Challenge 7

In the Census-Data-in-R-Challenges.Rmd, use mapview to create an interactive map of median household rent.

Determining what ACS Variables to use

ACS variables can be confusing.

Some ways to identify the best variables to explore:

  • Web search, especially Census web resources, can help.

  • The Census Reporter website (https://censusreporter.org) provides another tool for navigating topics, tables, and variable names.

  • The NHGIS website (nhgis.org) is a great way to browse variables of interest.

ACS Margins of Error (MOE)

We haven’t talked about it but it may be important in your work with ACS data.

Math is needed to combine MOEs when you combine variables.

  • tidycensus includes some nice functions for these calculations and a good overview of the topic.

Summary

tidycensus offers two key functions for fetching census tabular and geographic: get_acs and get_decennial.

  • The load_variables function helps identify the names of census variables of interest.

Support for fetching population estimates and migration flow census data was recently added to tidycensus. You can read up on it on the tidycensus documentation website

Using tidycensus to fetch the tabular data or both tabular and geographic data is IMO way easier than any alternatives, IF you (1) know R, (2) know a bit about working with geographic data in R.

This approach is also scaleable if you want multiple census variables for various locations and tabulation areas.

You can make publication or report ready maps with highly customizable ggplot2 code or use the sf::plot command to make quick maps.

Interactive mapping greatly enhances your ability to do exploratory data analysis in RStudio.

References

Much of this tutorial is based on resources by Kyle Walker, author of tidycensus. See:

Related D-Lab Workshops

Great online resource for working with spatial data in R